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The promise of epigenome-wide association studies and cancer-specific somatic DNA methylation changes in improving our 
understanding of cancer, coupled with the decreasing cost and increasing coverage of DNA methylation microarrays, has brought 
about a surge in the use of these technologies. Here, we aim to provide both a review of issues encountered in the processing and 
analysis of array-based DNA methylation data and a summary of the advantages of recent approaches proposed for handling 
those issues, focusing on approaches publicly available in open-source environments such as R and Bioconductor. We hope that 
the processing tools and analysis flowchart described herein will facilitate researchers to effectively use these powerful DNA 
methylation array-based platforms, thereby advancing our understanding of human health and disease. 



Epigenetic mechanisms associated with DNA methylation of 
cytosine residues at CpG dinucleotides have a central role in 
normal human development and disease (Baylin and Jones, 2011). 
Advancements in high-throughput assessment of DNA methyla- 
tion using microarrays or second-generation sequencing-based 
approaches have enabled the quantitative profiling of DNA 
methylation of CpG loci throughout the genome. As well as 
profiling the methylome of tumour compared with normal tissue, 
this has ushered in the era of epigenome-wide association studies, 
analogous to the genome-wide association studies, aimed at 
understanding the epigenetic basis of complex diseases such as 
cancer. The promise of methylation profiling in improving our 
understanding of cancer coupled with the current trend of 
decreasing cost and increasing coverage of DNA methylation 
microarrays has brought about a surge in the use of these 
technologies. 

Here, we aim to provide both a review of issues encountered in 
the processing and analysis of array-based DNA methylation data 



and a summary of recent approaches proposed for handling those 
issues. Excellent reviews in the field of epigenetics and technical 
aspects of array-based assessment of DNA methylation are 
available, although this is a constantly developing research area 
(Laird, 2010; Petronis, 2010; Baylin and Jones, 2011; Rakyan et al, 
2011; Bock, 2012). We seek to update perspectives on statistical 
issues that arise in the processing and analysis of array-based DNA 
methylation data (Siegmund, 2011), highlighting more recent 
methods proposed for this purpose. The subheadings shown in 
Figure 1 form the basis for the topics highlighted in this review. 
Our aim is to help researchers understand the growing body of 
statistical methods for array-based DNA methylation data, 
focusing on those freely available in open-source environments 
such as R or Bioconductor (Table 1). For this review, we chose to 
focus on Illumina's BeadArray assays; however, many of the 
general considerations described here are applicable to other array 
technologies. We also aim to counter some of the perceived 
limitations of these arrays, that is, there are too many 'false 
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Figure 1. Methylation array data processing and analysis pipeline. 



Table 1. R/Bioconductor packages for the processing and analysis of array-based DNA methylation data 




DNA methylation processing/analysis step 


R/Bloconductor packages 


Quality control samples 


IMA, HumMethQCReport, methylkit, MethyLumi, preprocessing and analysis pipeline, minfi 


Quality control probes 


IMA, HumMethQCReport, lumi, LumiWCIuster, preprocessing and analysis pipeline, wateRmelon 


Background correction 


Limma, lumi, MethyLumi, minfi, preprocessing and analysis pipeline 


Normalisation 


Combat^, HumMethQCReport, lumi, minfi, TurboNorm, MethyLumi, wateRmelon 


Type 1 and 2 probe scaling 


IMA, minfi, wateRmelon 


Batch/plate/chip/confounder adjustment 


Combat^ CpGassoc, ISVA, MethLAB 


Data dimension reduction 


MethyLumi 


Differential methylation analysis/region-based analysis 


CpGAssoc, IMA, limma, methylkit, MethLAB, MethVisual, minfi, EVORA 


Clustering/profile analysis 


Lumi, ISVA, HumMeth27QCReport, methylkit, RPMM, SS-RPMM^' 


Multiple testing correction 


CpGAssoc, methylkit, MethLAB, NHMMfdr 


^Freely available for download: http://www.bu.edu/jlab/wp-assets/ComBat/Abstract.html. 
'^Freely available for download: http://bio-epi.hitchcock.org/faculty/koestler.html. 



positives' in analysing microarray data (loannidis, 2007). We 
present the viewpoint that appropriate experimental design and 
downstream data processing and analysis pipelines will enable 
DNA methylation to be appropriately analysed and will help in 
understanding the pathogenesis of human disease. 



ILLUMINA BEADARRAY TECHNOLOGY FOR 
METHYLATION 



Illumina adapted its BeadArray technology for genotyping to 
recognise bisulphite-converted DNA for the interrogation of DNA 
methylation (Bibikova et al, 2011). The Illumina BeadArray assays 
use oligonucleotides conjugated to bead types to measure specific 



target sequences, measuring multiple beads per bead type. The 
bead types are summarised by the average signal for methylated 
(M) and unmethylated (U) alleles, and are used to compute the 
-value, where 

_ Max(M, 0) 

^ ~ Max(M, 0)+Max(L7, 0)+100 

A j^-value of 0 equates to an unmethylated CpG site and 1 to a 
fully methylated CpG site. Illumina has developed three platforms 
for array-based assessment of DNA methylation: GoldenGate, 
Infmium Human Methylation27 and the Infinium HD 450K 
methylation array, which all use two fluorescent dye colours but 
differ in the chemistries used to recognise the bisulphite- converted 
sequence; however, we will focus on the Infinium arrays for the rest 
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of this work, as the GoldenGate array has been phased out from 
production. Furthermore, Illumina has developed their Genome- 
Studio software (Bibikova et al, 2011), which enables basic data 
analysis; however, for more in-depth analysis, many tools have 
been developed, as we will discuss below. 



QUALITY CONTROL OF SAMPLES 



The Infinium arrays include several control probes for determining 
the data quality, including sample-independent and -dependent 
controls (Illumina, 2011). To detect poorly performing samples in 
Illumina arrays, diagnostic plots of control probes in Genome- 
Studio are often used (Bibikova et at, 2011), and the R-package 
HumMethQCReport (Mancuso et al, 2011) also provides these 
plots. Figure 2 shows hybridisation and bisulphite conversion plots 
for 450K data in the green channel. Although the sample- 
independent and -dependent controls can be visually inspected 
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Figure 2. Quality control example from GenomeStudio 450K data. 

(A) Hybridisation quality control plot in the green channel. (B) Bisulphite 
conversion quality control plot in the green channel. In this example, 
the separation between high and low values indicates that hybridisation 
worked well. Furthermore, bisulphite conversion also performed well as 
converted controls have a higher signal than unconverted controls. 



to identify poor performing samples, an alternative approach 
involves using the raw signal intensities of the control probes and 
determining whether they are beyond the expected range 
(e.g., median ± 3 s.d.) of the signal intensities across all samples. 

Other options for quality control of samples, which make use of 
detection P-values, are available in R and Bioconductor packages, 
such as the preprocessing and analysis pipeline (Touleimat and 
Tost, 2012), IMA (Wang et al, 2012), Minfi (Hansen and Aryee, 
2013) and MethyLumi (Davis et al, 2011). 



QUALITY CONTROL OF PROBES 



Similar to sample quality control, it is customary to filter probes if 
a certain proportion of samples (i.e., >25%) have a detection 
P- value below a certain prespecified threshold (i.e., P<0.05) 
(Bibikova et al, 2011). In the IMA package (Wang et al, 2012), 
probes with missing values, those residing on the X chromosome, 
and those with a median detection P- value > 0.05 across samples 
can be filtered out; other packages allowing such filtering include 
(Davis et al, 2011; Touleimat and Tost, 2012). 

LumiWCluster (Kuan et al, 2010) includes a function for model- 
based clustering of methylation data using a weighted likelihood 
approach wherein higher- quality samples (i.e., those with a low 
median detection P-value) have larger weights and thus greater 
influence on the estimation of the mixture parameters for cluster 
inference. This approach avoids discarding probes, characteristic of 
hard- thresholding approaches, allowing the incorporation of all the 
data while accounting for the quality of individual observations. 

A potential issue for quality control at the probe level stems 
from certain probes targeting CpG loci, which include single- 
nucleotide polymorphisms (SNPs) near or within the probe 
sequence or even in the target CpG dinucleotide; in fact, there 
may be up to 25% probes on the 450K array that are affected by an 
SNP (Liu et al, 2013). As methylation levels of a specific locus 
may be influenced by genotype (Dedeurwaerder et al, 2011a), 
investigators may want to remove those SNP-associated loci from 
their data, and several R packages have options for carrying this 
out (Touleimat and Tost, 2012; Wang et al, 2012). Genetic effects, 
however, should not be underestimated in methylation arrays. As 
was recently demonstrated in Fraser et al (2012), a large portion of 
population- specific DNA methylation levels may in fact be due to 
population- specific genetic variants, which are themselves affected 
by genetic or environmental interactions. Although rare SNPs are 
unlikely to affect methylation levels to a large extent, somatic 
mutations can impact methylation levels greatly, such as driver 
mutations in a tumour; hence, the importance of subsequent 
sequencing validation. 

Additional probes that a researcher may want to remove from 
their data include the 'Chen probes'. This is evidenced in a recently 
published paper showing that there may be spurious cross - 
hybridisation of Infinium probes on the 450K array and further 
suggesting that cross -hybridisation to the sex chromosomes may 
account for the large gender effects that researchers have found on 
the autosomal chromosomes (Chen et al, 2013). Finally, a number 
of SNP probes are also included on the Infinium array that can 
help identify mislabelled samples, as implemented in wateRmelon 
(Pidsley et al, 2013). 



BACKGROUND CORRECTION 



Background correction is platform specific, helps to remove 
nonspecific signal from total signal and corrects for between-array 
artefacts. Although this can be performed using Illumina's 
GenomeStudio, several R packages contain background correction 
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functions. This includes the preprocessing and analysis pipeline for 
450K data (Touleimat and Tost, 2012), providing background-level 
correction using lumi (Du, 2008), and furthermore Limma 
(Wettenhall and Smyth, 2004) and MethyLumi (Davis et a/, 
2011). Background can also be estimated by direct estimation from 
the density modes of the intensities measured by each probe. 
However, the latter has been shown to produce aberrant DNA 
methylation profiles, so using negative control probes may be 
preferred (Touleimat and Tost, 2012). One can also use Minfi 
(Hansen and Aryee, 2013) as a background estimation method; 
however, the authors acknowledge that this method may result in 
differing values compared with those estimated via GenomeStudio. 



NORMALISATION 



Normalisation concerns the removal of sources of experimental 
artefacts, random noise and technical and systematic variation 
caused by microarray technology, which, if left unaddressed, has 
the potential to mask true biological differences (Sun et al, 2011a). 
Two different types of normalisation exist: (1) between-array 
normalisation, removing technical artefacts between samples on 
different arrays, and (2) within-array normalisation, correcting for 
intensity-related dye biases (Siegmund, 2011). 

Owing to the features of DNA methylation, there is a lack of 
consensus regarding the optimal approach for normalisation of 
methylation data. Specifically, there is an imbalance in methylation 
levels throughout the genome creating a skewness to the 
methylation log- ratio distribution; the degree of this skewness is 
dependent on the levels of methylation in particular samples 
(Siegmund, 2011). This imbalance is due to the non-random 
distribution of CpG sites throughout the genome and the link 
between CpG density and DNA methylation; for instance, CpG 
islands (CGI) are often unmethylated, whereas the opposite 
relationship is typically seen in non-CGIs in normal human cells 
(Baylin and Jones, 2011). Furthermore, total fluorescence signal is 
inversely related to DNA methylation levels (Siegmund, 2011). 
Many available normalisation methods were designed for gene 
expression array data and are based on assumptions that may not 
be appropriate for DNA methylation microarray data. 

GenomeStudio provides an internal control normalisation 
method for the 450K assay (Illumina, 2008), which is also used 
in MethyLumi (Davis et al, 2011) and Minfi (Hansen and Aryee, 
2013); by default, GenomeStudio uses the first sample in the array 
as the reference and allows the user to reselect the reference sample 
as needed if the original sample is nongenomic or of poor quality. 

Quantile normalisation is one of the most commonly used 
normalisation techniques. Locally weighted scatterplot smoothing 
(LOESS) normalisation is an intensity- dependent normaUsation 
method that assumes independence between the difference in log 
fluorescence signals between two samples and the average of the 
log signals from the two dyes (Siegmund, 2011). Quantile and 
LOESS normalisation (Laird, 2010) assume similar total signal 
across samples and can therefore remove true biological signal, 
because of the nature of DNA methylation described above, and 
have assumptions unlikely to hold for methylation data. As the 
Infinium I and II probe types examine different subsets of the 
genome, described in detail below, quantile normalisation cannot 
be applied indiscriminantly across probe types. 

Lumi (Du, 2008), also used in HumMethQCReport (Mancuso 
et a/, 2011), offers an alternative to quantile normalisation through 
a robust spline normalisation, which is designed to normalise 
variance-stabilised data by combining features of both quantile and 
LOESS normaUsation (Du, 2008). Another approach, subset 
quantile normalisation (Wu and Aryee, 2010), normalises the data 
on the basis of on a subset of negative control or CpG-free probes 



that are independent of DNA methylation but suffers the same 
issues as other quantile approaches. The TurboNorm R package 
(van Iterson et al, 2012) provides an alternative to LOESS 
normalisation using a weighted P-spline intensity-dependent 
normalisation technique and can be applied to two colour arrays. 
A more recent method (Johnson et al, 2007), which we describe in 
more detail below, performs both normalisation and batch- effect 
correction. A comparison of different normalisation pipelines for 
Illumina 450K data can be found in two recent publications 
(Marabita et al, 2013; Pidsley et al, 2013). 



TYPE I AND II PROBE SCALING 



Another potential methodological concern stems from the fact that 
the 450K array uses two different types of probes, prompting the 
recommendation of rescaling to make the probe distributions 
comparable (Bibikova et al, 2011). Specifically, the 450K array has 
485 577 probes, of which 72% use the Infinium type II primer 
extension assay where the unmethylated (red channel) and 
methylated (green channel) signals are measured by a single bead 
(Bibikova et al, 2011). The remainder use the Infinium type I 
primer extension assay (also used in the 27K Infinium array) where 
the unmethylated and methylated signals are measured by different 
beads in the same colour channel (Bibikova et al, 2011). 
Importantly, the two probes differ in terms of CpG density, with 
more CpGs mapping to CpG islands for type I probes (57%) as 
compared with type II probes (21%) (Bibikova et al, 2011). 
Moreover, compared with Infinium I probes, the range of j^-values 
obtained from the Infinium II probes is smaller; in addition, the 
Infinium II probes also appear to be less sensitive for the detection 
of extreme methylation values and display a greater variance 
between replicates (Dedeurwaerder et al, 2011a). 

The divergence in the methylation distribution range has 
implications for statistical analysis of the array data. For example, 
in a supervised analysis of all probes, an enrichment bias towards 
type I probes may be created when ranking probes because of the 
higher range of type I probes (Maksimovic et al, 2012). In addition, 
region-based analyses assume that probes within those regions are 
comparable, potentially untenable because of the diverging 
chemistries on the 450K array (Maksimovic et al, 2012). Moreover, 
when performing profile analyses or clustering, the differing 
chemistries between the two probes types may drive the clustering 
solution. 

Attempts have been made to use rescaling to 'repair' the 
divergence between these two types of probes. The first correction 
method proposed was peak-based correction (Dedeurwaerder et al, 
2011a), implemented in IMA (Wang et al, 2012), wherein the 
Infinium II data is rescaled on the basis of the Infinium I data 
assuming a bimodal shape of the methylation density profiles. 
However, several researchers have noted that this method is 
sensitive to variation in the shape of DNA methylation density 
curves and does not work well when the density distribution does 
not exhibit well-defined peaks or modes (Pan et al, 2012; 
Touleimat and Tost, 2012; Teschendorff et al, 2013). 

Three alternative approaches have been proposed recently to 
address the limitations of the peak base correction approach. The 
first, subset-quantile within-array normalization (Maksimovic et al, 
2012), is available in Minfi. Subset-quantile within-array normal- 
ization determines an average quantile distribution using a subset 
of probes defined to be biologically similar on the basis of CpG 
content and allows the Infinium I and II probes to be normalised 
together (Maksimovic et al, 2012). 

The second, subset quantile normalisation (Touleimat and 
Tost, 2012), uses the genomic location of CpGs to create probe 
subgroups through which they apply subset quantile normalisation. 
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The reference quantiles used in this approach are based on type I 
probes with significant detection P-values (Touleimat and Tost, 
2012). 

Finally, the j^-mixture quantile dilation normalisation method, 
implemented in the wateRmelon package (Pidsley et al, 2013), uses 
quantiles to normalise the type II probe values into a distribution 
comparable to the type I probes using a j^-mixture model fit to the 
type I and type II probes separately and then transforms the 
probabilities of class membership of the type II probes into 
quantiles (j^-values) using the parameters of the j^- distributions of 
the type I distribution (Teschendorff et al, 2013). This method uses 
a three- state j^-mixture model but does not use fit to the middle 
'hemimethylated' component in the normalisation; therefore, it 
does not require a trimodal distribution (Teschendorff et al, 2013). 
An advantage of BMIQ is that it avoids selecting subsets of probes 
matched for biological characteristics as done in the previous 
method and was found to be the best algorithm for reducing probe 
design bias in a recent paper (Marabita et al, 2013). 

Rescaling using the methods mentioned above may be 
unnecessary when analysing 450K data on a CpG-by-CpG basis 
because the comparisons will be made at the individual probe level. 



ADJUSTMENT BATCH/PLATE/CHIP/OTHER 
CONFOUNDERS 



DNA methylation arrays are susceptible to batch effects: technical 
remnants that are not associated with the biological question but 
with unrelated factors such as laboratory conditions or experiment 
time (Leek et al, 2010; Sun et al, 2011b). Normalisation has been 
shown to reduce some component of batch effects, although not all 
(Teschendorff et al, 2009; Leek et al, 2010; Sun et al, 2011b). Sound 
study design is critical for proper evaluation of and correction for 
batch effects: for instance, samples from different study groups 
should be split randomly or equally into different batches (Johnson 
et al, 2007). By properly correcting for batch effects, one can 
combine data from multiple batches, enabling greater statistical 
power to measure a specific association of interest (Johnson et al, 
2007). 

Several methods have been proposed to adjust for batch effects. 
ComBat uses an empirical Bayes procedure for this (Johnson et al, 
2007), is robust to outliers in small sample sizes and can adjust for 
other potential confounders along with batch (Sun et al, 2011b). 
However, this method can be computationally burdensome and 
was initially developed for gene expression data; therefore, it 
requires a transformation of methylation data, which follows the 
j5-distribution, to satisfy the assumption of normality. 

Other R packages exist to adjust for batch effects. MethLAB 
(Kilaru et al, 2012) and CpG assoc (Barfield et al, 2012) allow the 
adjustment for batch using a mixed-effects model framework. 
However, because these methods do not directly adjust the data, 
unlike ComBat, they should be used only for a locus -by-locus 
analysis. 

The array literature indicates that array position effects may also 
exist (van Eijk et al, 2012), and thus new batch correction 
techniques may be needed to take those into account. When 
phenotype distribution is heterogeneous across chips, which can 
occur in small samples even after randomisation, methods such as 
ComBAT can fail; in this case, linear mixed-effects models treating 
chip effects as random is an alternative. 

However, in certain cases, the true sources of batch effects or 
confounding are unknown or cannot be adequately modelled 
statistically (Leek et al, 2010). In such cases, two methods, 
surrogate variable analysis (SVA) (Leek and Storey, 2007) and 
independent surrogate variable analysis (ISVA) (Teschendorff et al, 
2011), also available as the ISVA R package, are very useful. 
Surrogate variable analysis estimates the source of batch effects 



directly from the array data, and variables estimated with SVA 
(SVs) can then be included into the statistical model as covariates 
(Leek and Storey, 2007). A modified version of SVA, ISVA, 
identifies features correlating with the phenotype of interest in the 
presence of potential or unknown confounding factors, which are 
modelled as statistically independent surrogate variables or ISVs 
(Teschendorff et al, 2011). This method could also be used for 
batch effects by constructing ISVs that are associated with these as 
potential confounders and including them in the analytical model. 
A problem with this technique occurs when the ISVs correlate both 
with the phenotype of interest and with the potential confounders, 
making model covariate selection difficult. Furthermore, ISVA and 
SVA do not directly adjust the methylation data, like ComBAT 
does, which may be problematic if the analytical goal is clustering. 
One could, however, fit a model with the estimated SVs or ISVs 
and compute the residuals for subsequent analyses. 



DOWNSTREAM ANALYSIS 



Methylation status. Average P or the j^-value is a commonly used 
metric to denote the level or percentage of methylation for an 
interrogated locus. Investigators also use the M-value, or log ratio, 
to measure methylation (Du et al, 2010): 



M = log^ 



Max(M, 0) 
Max([7,0) 



A normalised M-value near 0 signifies a semimethylated locus, a 
positive M-value indicates that more molecules are methylated 
than unmethylated, whereas negative M-values have the opposite 
interpretation (Du et al, 2010). An M-value is attractive in that it 
can been used in many statistical models derived for expression 
arrays that assume normality (Du et al, 2010). However, values 
are much more biologically interpretable than their counterpart; 
furthermore, a recent paper found supervised principal compo- 
nents analysis (SPCA), as described below, to work better in the 
context of j^-values as opposed to M-values (Zhuang et al, 2012). 
The relationship between the P- and M-value is captured by 
(Du et al, 2010): 



M = log2 



1-/ 



Differential methylation/region-based analysis. Locus -by-locus 
analyses examine the relationship between a phenotype of interest 
and methylation of individual CpG sites across the genome, 
seeking to find differentially methylated sites. Differential methyla- 
tion analysis aims to determine methylation differences between 
the specific groups (such as cases and controls), such as probe- wise 
or locus-specific methylation differences; the two terminologies are 
therefore equivalent when at the individual locus level. A very 
simple example is Delta B (Bibikova et al, 2011; Touleimat and 
Tost, 2012), where a difference is applied to two groups' 
methylation medians for each CpG locus; if the absolute value of 
the difference in medians across samples of each group is higher 
than 0.2, then that locus is considered to be differentially 
methylated. This 0.2 threshold corresponds to the recommended 
difference in methylation between samples that can be detected 
with 99% confidence (Bibikova et al, 2011). MethVisual (Zackay 
and Steinhoff, 2010) tests whether each CpG site has independent 
membership between two groups using Fisher's exact test; other 
packages include (Wettenhall and Smyth, 2004; Barfield et al, 2012; 
Kilaru et al, 2012; Wang et al, 2012), some allowing for the 
adjustment of potential confounders (Barfield et al, 2012; Kilaru 
et al, 2012; Wang et al, 2012). Minfi (Hansen and Aryee, 2013) uses 
linear regression and an F-test to test for a univariate association 
between the methylation of individual loci and continuous or 
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categorical phenotypes, respectively. When sample sizes are < 10, 
Minfi (Hansen and Aryee, 2013) has options for using limma 
(Wettenhall and Smyth, 2004). Specifically, limma uses an 
empirical Bayes moderated t-test, computed for each probe, which 
is similar to a ^-test, except that the standard errors have been 
shrunk towards a common value. M- values should be used in these 
cases as, being based on a Bayesian Gaussian model, they will rely 
much more heavily on the Gaussianity assumption (Zhuang et al, 

2012) . The IMA package (Wang et al, 2012) allows site 
(methylation locus) -specific and region (all loci in a gene) -specific 
differential methylation analysis using Student's t-test and 
empirical Bayes statistics. For region analysis, IMA will compute 
the mean, median or Tukey's biweight robust average for the loci 
within that region and create an index (Wang et al 2012). 
Methylkit (Akalin et al, 2012) allows for analysis at the site or 
regional level using logistic regression or Fisher's exact test. With 
multiple samples per group, methylkit will preferentially use 
logistic regression, enabling also the inclusion of potential 
confounders (Akalin et al, 2012); to get stable estimates of the 
regression coefficients in logistic regression about 10 events per 
variable are necessary (Peduzzi et al, 1996). 

Differential methylation analysis can also be performed by 
measuring variability between methylation loci as opposed to using 
statistical tests on the basis of differences in mean methylation 
(Xu et al, 2013). This is available in the EVORA package, allowing 
an investigator to use differential variability in methylation of 
CpGs and to then associate them with a phenotype of interest, such 
as cancer status (Teschendorff and Widschwendter, 2012; Xu et al, 

2013) . 

As noted in several recent works, nearby CpG loci tend to have 
methylation levels that are highly correlated (Leek et al, 2010). As a 
result, statistical analyses that assume independence may be 
problematic. Methods are being developed to deal with this 
potential problem and include bump -hunting techniques (Leek 
and Storey, 2007), which take into account CpG proximity and 
borrow strength across neighbouring probes. Although these 
approaches were originally developed for CHARM assays, they 
may be adapted to the less-dense 450K array, pending careful 
attention to the tuning parameters for defining a 'region'. 

Although the above methods have proved successful in 
identifying individual CpG sites that associate with some 
phenotype/exposure of interest, the extent to which the methyla- 
tion of these sites reflect true changes to the methylome or 
represent heterogeneity in underlying cell-type distributions 
depends largely on the tissue being sampled (Teschendorff et al, 
2009; Houseman et al, 2012). We recently developed a set of 
statistical methods that exploit the use of leukocyte-specific DMRs 
for inferring changes in cell mixture proportions based solely on 
peripheral blood profiles of DNA methylation (Houseman et al, 

2012) . Under certain constraints, this approach can be used to 
approximate the underlying distribution of cell proportions among 
samples comprising a heterogeneous mixture of cell populations 
with distinct DNA methylation profiles (Houseman et al, 2012). 
This method has recently been used for predicting cell-type 
proportions, which were then subsequently added as additional 
covariate terms in a differential methylation analysis of rheumatoid 
arthritis cases/controls (Liu et al, 2013). Furthermore, the methods 
of Houseman et al (2012) were recently validated using a publicly 
available data set (Lam et al, 2012) that consisted of both PBMC- 
derived DNA methylation profiles and complete blood cell (CBC) 
counts for 94 healthy, non-diseased adult subjects (Koestler et al, 

2013) . 

Clustering/profile analysis. Clustering refers to the grouping of 
objects into clusters, such that the objects within the same cluster 
are more similar compared with objects in different clusters. 
Owing to the interest in identifying molecular subtypes in the 



context of cancer, clustering has become a staple technique in the 
analysis of array-based DNA methylation data. 

Two very well-known non-hierarchical methods used to cluster 
DNA methylation include K-means and K-medoids, also known as 
partitioning around medoids or PAM (Pollard et al, 2005). Two 
disadvantages of K-means are that it requires the prespecification 
of the number of classes, which is not often known; furthermore, 
K-means create clusters based only on the first moment, which is 
problematic in cases where the variance of a specific probe contains 
biologically important information. Another commonly used 
method to detect patterns in methylation data is PCA, which is a 
latent variable method often appUed as a dimension reduction 
procedure and used for the detection of batch effects (Jolliffe, 
2002). Principal component analysis was first applied to genome- 
wide Infmium HumanMethylation27 DNA methylation data as 
shown in Teschendorff et al (2009). Principal component analysis 
is used to develop a smaller number of artificial variables, called 
principal components, which account for most of the variance in 
the observed variables of a data set (Jolliffe, 2002); usually only the 
first few components are kept as potential predictors for statistical 
modelling (Jolliffe, 2002). However, additional principal compo- 
nents may be of biological significance as shown in Teschendorff 
et al (2009). A method to estimate the number of significant PCA 
components is available in the ISVA package (Teschendorff et al, 
2011). This algorithm is based on the Random Matrix Theory 
(Plerou et al, 2002), which can be used to estimate the number of 
significant PCA components that are subsequently examined for 
their association with study- specific characteristics. Random 
Matrix Theory estimates the number of significant components 
of a data covariance matrix by comparing the statistics of the 
observed eigenvalues obtained from PCA with those obtained from 
a random matrix. The main disadvantage with PCA lies in the poor 
interpretability of the resulting principal components and the 
requirement of a large sample size in order to obtain reliable 
results. 

Another well-known clustering method is hierarchical cluster- 
ing, which builds a binary tree by successively merging similar 
samples or probes based on a measure of similarity (Eisen, 1998). 
However, because of its unsupervised nature, this form of 
clustering may or may not predict a phenotype of interest, as it 
does not use data beyond methylation to form clusters. Lumi 
(Du, 2008), HumMeth27QCReport (Mancuso et al, 2011) and 
methylkit (Akalin et al, 2012) all provide hierarchical clustering 
and PCA options using normalised M-values. 

In addition to non-parametric techniques for clustering or 
profile analysis. Houseman et al (2008) developed a recursive- 
partitioning mixture model (RPMM), an unsupervised, model- 
based, hierarchical clustering methodology for array-based DNA 
methylation data. Recursive -partitioning mixture model assumes a 
j^-mixture model to split samples between subgroups and provides 
an estimate for the number of clusters; furthermore, it is 
computationally efficient relative to the standard finite mixture 
model approach (Houseman et al, 2008). Owing to the inherent 
correlation in the methylation status of nearby CpG sites, there 
have also been efforts to incorporate correlation structures based 
on the proximity of CpGs in the context RPMM (Leek et al, 2010). 

Semisupervised methods use both array-based genomic data 
and clinical data for identifying profiles that are associated with a 
clinical variable of interest, such as survival. Semisupervised 
clustering (SS-Clust) begins by identifying a set of genes that 
correlate with a phenotype of interest, followed by unsupervised 
clustering of samples based on the set of genes (Bair and 
Tibshirani, 2004). Supervised principal components analysis uses 
a similar methodology to SS-Clust, but replaces unsupervised 
clustering with PCA, providing a 'risk score' for each patient, 
which is then used as a continuous predictor of survival (Jolliffe, 
2002). Semisupervised clustering's main disadvantage is that it 
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requires prespecification of the number of clusters; moreover, 
SPCA inherits the interpretabiUty issues characteristic of PGA. 
Semisupervised RPMM (Koestler et al, 2010) has been shown to 
outperform SS-Clust and SPCA under certain circumstances and 
does not require the prespecification of the number of clusters. 

One of the first attempts to discover novel tumour classes 
through profiling of methylation data involved a supervised 
method called support vector machine (SVM) including a cross- 
validation method to evaluate its prediction performance (Adorjan 
et al 2002). This approach was initially very computationally 
intensive but was a precursor to other profile analysis methods. 
Another method, Elastic net, is a shrinkage and selection method, 
which produces a sparse model with good prediction accuracy, 
while encouraging a grouping effect (Zou and Hastie, 2005); this 
algorithm is now being widely used on all types of omics data 
(Barretina et al, 2012; Hannum et al, 2013) and was compared with 
SVM and SPCA in Zhuang et al, 2012 and shown to be far 
superior. 

Pathway analysis. Many researchers use pathway analysis to 
characterise the function of the gene in which the individual or 
group of loci are found. Several software packages do this; however, 
we focus on two freely available resources that can also be used in 
R. The Gene Ontology (GO) provides a very detailed representa- 
tion of functional relationships between biological processes, 
molecular function and cellular components across eukaryotic 
biology (Ashburner et al, 2000). Another resource that borrows 
heavily from GO is PANTHER (Thomas et al, 2003), which relates 
protein sequence relationships to functional relationships. How- 
ever, many commonly used pathway analysis methods are based on 
gene expression correlation or protein-protein interaction; 
although pathway perturbations are likely to be evident in 
expression changes across all genes of a pathway, a single well- 
placed alteration of DNA methylation, acting as an epigenetic 
switch, may alter all downstream mRNA expression. In light of 
this, sensitivity for detecting significant pathways is lower for DNA 
methylation than it might be for mRNA expression. In addition, 
unlike mRNA expression, CpGs have different implications for 
expression depending on where they exist in relation to a gene or if 
they are mapped to any gene at all. As the 450K array has great 
heterogeneity with respect to the CpG representation by gene 
region, there is the potential for pathway analysis on 450K data to 
be biased by CpG selection. In addition, as genes are not equally 
covered throughout the array through the number of probes in 
their specific regions, this may further bias this analysis. Therefore, 
in using such approaches, we recommend stratification by gene 
region (e.g., promoter) to decrease the potential for bias. Once a 
specific region has been chosen, then pathway analysis, GSEA, or 
integration with interaction networks could be a fruitful procedure, 
as recently demonstrated in Dedeurwaerder et al (2011b) and West 
et al (2013). 



MULTIPLE TESTING CORRECTION 



Once the analysis has identified top hits, multiple testing correction 
is necessary to reduce the likelihood of identifying false-positive 
loci by adjusting statistical confidence measures by the number of 
tests performed. Bonferroni correction consists of multiplying each 
probability by the total number of tests performed; this controls the 
family- wise error rate (Holm, 1979). 

A less -conservative, widely used, approach involves controlling 
the FDR {q-yahie) or the expected proportion of false discoveries 
among the discoveries; this also uses a sequential P-value method 
(Benjamini et al, 2001); several R packages allow for the adjustment 
of the FDR (Barfield et al, 2012; Kilaru et al, 2012; Wang et al, 
2012). All of the aforementioned methods assume statistical 



independence of the multiple tests, which can be violated when 
tests exhibit strong correlations (as mentioned above); further- 
more, (^-values imply subsequent validation in an independent 
sample, which may not occur. A potential solution to this 
independence assumption is with the use of permutation testing 
in which the phenotype of interest is randomly re-assigned, and the 
data reanalysed. CpG assoc provides a permutation testing option 
to obtain empirical P- values (Barfield et al, 2012). 



VALIDATION OF SIGNIFICANT HITS 



The final step in the proper processing and analysis of DNA 
methylation arrays is validation of significant hits by an 
independent experimental approach or data resource. The gold 
standard is bisulphite sequencing-based methods, such as pyr- 
osequencing (Ammerpohl et al, 2009) and Epityper (Laird, 2010), 
to provide high-throughput quantitation (Siegmund, 2011). 
Another valuable resource for validation (and exploration) of 
DNA methylation array data is publicly available repositories such 
as the Gene Expression Omnibus (Edgar et al, 2002). Finally, with 
the availability of data resources such as the above and HAPMAP 
(Altshuler et al, 2010), researchers can now integrate their 
methylation array data with these resources, to help further 
understand molecular and genomic profiles that contribute to 
outcomes of interest such as cancer risk. 



CONCLUSIONS 



Owing to the plethora and complexity of methods for array 
processing and analysis, described above, and to the multitude of 
researchers using DNA methylation arrays, there is a need to create 
a protocol of good practice to ensure that study results are of the 
highest quality possible. Just as gold standard laboratory methods 
are crucial to the generation of quality biological data, gold 
standard processing and analytical methods are equally as 
important. Through the proper use of the processing and analysis 
flowchart described above, we hope that potential users will best 
harness these powerful array-based tools, which will in turn lead to 
rapid discoveries in human health and disease. 
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